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ABSTRACT 



A Ttey element of an information system is /the - 
representation of the information items. Studies have 
found that', when* using precision and recall performance 
measures, the differenbes among various representations 
are not critical T ^Evidence does indicate that the actual 
items retrieved vary significantly from representation to 
representation. This x study will determine the impact of 
Representation on the retrieval of information items in 
terms of performance and overlap and suggest performance 
limits for an information system, given a specific 
representation. 

This interim report describes Phase I of the projfect 
Seven representations were tested using a latin aquare 
design on 84 queries. The°INSPEC Computers and Control 
Abs tracts was the study data base loaded on the DIATOM 
system. The* data generally confii^n the earlier observed 
data: overlaps were agaift small. Plans for replication 
and theory development in Phase II "are describe^. * 
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, I. INTRODUCTION 

v » 

This ; report presents the interim results of the 
Document Representation study. The report will describe the 
research^. background and objectives, procedures used during 
the £irst phase of the study, results of the first phase/ 
and plans for the second phase. The document representation 
study, is designed to provide fundamental knowledge of'^he 
effect of the representation —of information items on 
information system performance. 

Past studies have.*found that, when using precision and 
recall performance measures, the differences among various 
representations, is not critical.^ Studies to date have 

■ 

examined the precision and' recall performance of two or more 
representations. The unifying element of these studies is a- 
search for a '"better" representation. That is, given a 
specified environment and using" a particular set o£ queries, 
which representation performs better in teVmsT of precision 
and recall? In th^e studies, no one representation clearly 
outperforms others. But studies have shown that when using 
a" particular representation it is possible to 6mploy 
stechniques to Enhance „ ' the performance of that 
representation. / 
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This study takes as its departutrta^evjjdence that 
performance measures have masked real and systematic 
differences kmong the representations. Specifically, 
different representations* result in the retrieval of 
different items. Two previous studies support . the 
hypothesis. - 



The Ranlcing Project (MCGILL) * examined the specific 
items^^retr ieved from, each of the representations us^d in 
;that study. The same searcher using different 
representations for the sam£ information need statement had 
an overlap of retrieved items totalling 14%. Different 
searchers using different representatJons had an overlap of 
the retrieved set of 5%. That is, this, study , found' that 
lusing the . free representation or the controlled 
representation did not affect performance measures , but it 
c|id impact the actual items , retrieved by the system. The 
user c^n expect .approximately ? the same number of relevant 
documents using either representation - however, the actual 
documents retrieved are not the same* 
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SMITH examined the / combination of ^document 



/ 



grlty measure. Her bwc 



representation and similarity measure. Her jwork was 
conducted using a subset/of \ the INSPEC data base. Using the 



representation of x a document as a query, she examined Seven 
different representations . S1*ITH -did- not investigate 
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performance measures, but did report non-symmetric oveorlap. 

Non-symmetric overlap was defined as 

- n (AnB) and n(AnB) 
^* "TT(By * n(A) * 

v 

The non-symmetric measure indicates the direction of 
the overlap, Nonsymmetric * overlap measures among the 
retrieved sets ranged from a mean overlap measure of .489 
(or approximately 50% of the documents were 4 in sets 
retrieved by both representations) to a._ mean of .004 (or 
only 0.4% of , the documents were retrieved by both 
representations* * 



These studies indicate the potential importance of the 

< : - . . f 

-selection of representations of information items. However, 

neither of the above studies is conclusive or generalizable. 

This study is designed to' build on th£ previous findings ajid 

to ultimately develop a theoretical model accounting for 

representation differences. 
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II. OBJECTIVES 



\ 

The assessment of the various*' representations is 
concerned with a number of specific pb^ecti ves : 

\ 

(1) To determine if the information \tems retrieyed by 

the differing representations, are srcjni'f ica/tly^ and 
■ » 

substantially dj.ff^rent. 

(2) Td assess the effectiveness of representations or 
combinations 6f representations,* 

(3) To develop and test a theoretic model sufficient to 
explain", any/" differences in information retrieval sys\^m 
operation based on changes in the representation oj 
information items. ^ 



At the conclusion of tiie<"study # aji information 
scientist should be ab3^6b discern the relative impact of a 
particular representation. The data should indicate which 
repr^a^ntations are redundant or may^ be used in place of 
another, and which representations^ may* pe used ' Jn 
combination N >£o enhance a particular ^aspect of system 
performance, such as recall. . Finally, it may be possible to 
specify ' upper bounds, of particular performance measures 
given a particular representation. \ 
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III. RETRIEVAL ENVIRONMENT ' * 



Data Base. 

-Permission was granted by % the Institution of Electrical 

* « 

Engineers to use the Computer and Control Abstracts portion 
of the^INSPEC data base. Altogether 12, 000 documents formed 
the data base used in this study. These constituted the 
September - December" 1979 issues of Computer and » Control 
Abstracts^ The choice of this data base and its size 
ovided enough topic " specificity *W ensure-' that a 
reasonable number of documents woul-d be retrieved , in each 
representation. ^ 

e 

% £ach' document consisted of a series of bibliographic 
citation fields, an abstract, and some indexing information. 
The*format of each document record as it was pointed upon 
retrieval is as: follows: 

*- - DNnumber '(abstract numbers froita INSPEC journals) * 
Title 

Authors (separated by commas) 
"Source field: as follows' 

Publication: (volume and issue number) 
p % (part number) pagination data 

_ * Following this may* be information in 

[ ]• "This is information on the cover- 
41 v * .to-cover translation as follows: 

[publication; (volume arid issue) pag^s, 
date] (type of unconventional media) 
(availability)^ (Title of conference) 
location of conference) (sponsoring 
organization) (date) language; 
Abstract 

IndeKing information 



'B. Retrieval System * 
i- 

DIATOM, an on-line retrieval system which Was designed- 
to simulate most of the^ 'featur&s of Dialog, was used to 
conduct all the searches in. this study. DIATOM \/as designed' 
av\6 ^programmed by Bob Wald^tein, a*"phD student at the Schtool 
of Information Studies, 

\ »* < f • 

The major > differences between DIATOM and those of 
DIALOG are -listed below. 

* > y 

1. Diatom permitted the searchers to log on directly 

• to a particular * Representation. All search 
statements were subsequently restricted to that 
representation only. 

2. The system included a stemmer used for the stem 
representation. 

3. To restrict a search to-a particular language, 
a Limit /ENG (for English) was used. 

* 4. Adjacency (nW) could not be used witfc either 

* truncation or .stemming. ' - 

5. Adjacency at times ran v£ry slow;' the field 
-operator (F) could be used instead. 
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C. Search Intermediaries • ■ 

A total of seven* intermediaries were repaired for the 

research design. All of the intermediaries ,used in:*the 

study were professional librarians or, information hrokers* 

with experience using Computerized retrieval systems; aH 

had hacl some experience using DIALOG. <s . r 

» • « 

All intermediaries took part in a one day long , trailing 

session. Afterwards, each intermediary wa£ required to 

familiarize himself with the system and make at least 14 

searches to^the data base. A copy of the training materials. 

furnished the intermediaries is provided in Appendix A. 

» * < <, • # 

i 

/ ° < * i 

D. Users and Queries * * . 

Originally the study specified 98 users, each of whom 
was to provide .a single interest statement or query., 
However, because of difficulty in obtaining users, the study 

was reduced to 84 queries. User.s^werp. solicited from the 

' ' r 

Syracuse University community and institutions concerned 

with ' information retrieval. Table . 1 ^ indicates 

« • i * 

characteristics of the users. Our objective in accepting 

users was to come as close as possible to criteria used in 

operational, search services so* that queries and relevance 

judgments could plausibly be generalized. 
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TABLfe 1 
Characteristics of Users 



■f 



\ 



Affiliation 



No. 

of Sci/ No..of 

Users-Faculty-Students-Eng-Others-Queries 



Syracuse U. 


35 


26 


8 ' 


0* 


1 


41 


General 

^ 1 A. .M Am mm £ mm 

Electric ^ 
Univ. of 

Till no \ Q 


1 

5 


U 

2 


u * 

r 3 


» 1 

f 

0 


n 
u 

0 


4 
5 


Univ. of 
Louisville 


9 




0 


0 


'-9 ' 


14 

» 


National 
Bureau of 
Standards 


6 


* 

0 


4 

0 


> 

6 


0 


6 , . , 


OCLC,JNC. 


5 




• 0 


5 


0 




Environmental 
Protection 
Agency. w 


6 


' 0 


0 


► 

6 


9 

0 


6 


OTISCA 
Industries* 


1 


o • 


0 

> 


0 

/ 


♦1 


1 


'suiiy • 

Cplleg* of 
Environ.* 
Sciences & 
Forestry 

A 

2. * 


1 


o * 

« J 


1 

t 


* 

0 


0 


1 


* 


69 


28 


12 


> 

18;* 


11 

/ « 


84 * 



♦Altogether, 69* individuals/ served as users in. this study. 
11 of these individuals .submitted more than one query: 
8 users submitted 2 queries, 2 users submitted 3 queries ♦ 
and 1 user submitted 4 queries. 
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E. Relevance Judgments 



Relevance judgments were obtained, from the users for 

all documents retrieved for the query.* A fourrpoint scale 

... ' / . 

was usedvwith ."1" and "2" indicating -relevant, "3"' and "4" 

indicating non-relevant, * The instructions which accompanied 

the search results .are provided in Appendix *B. 



♦After repeated! attempts, four users did not return 
their relevance judgments. In these few cases we identified 
other .individuals who presumably could make relevance 
judgments in the specific topic area of the query. These 
surrogate users made the relevance 1 judgments. 
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IVfi METHODOLOGY 



A. Variables 

The key experimental or independent variable was the 
representation uspd \ £xy y searching the data base* Seven 
representations • wei'e^ihosen : 



TT - t 



e only* 



AA - ter'ms in abstract only*- 

11 * * * 

' \ ** 1 * 

x * 1 i * 
DD - descriptor! terms only* 

II - identifier \fcexW only. 

\\ 1 

TA - terms in tittle and abstract only. 

ST - stemtaed terras! inU:itle and abstract Qnly^ 

(The computer a\utoma^cal^jf4Sakea ^l^lbglcal root 
of any entered term,) 



DI - terms in descriptor and identifier fields. 

The major dependent variables were performance measures 
(recall and ; precision) and measures of overlap. In 
addition, a count of the total number of retrieved documents 
was also analyzed. * A more precise description of each of 
the measures is given below. 



RECALL. The recall ratios were formed by dividing -th6 

number of „ relevant documents retrieved by each 

16 
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representation by the total 'numbfer of relevant documents 
retrieved by all seven representations. Two versions of 
* recall were computed. 

Recall-l; defined a relevant document stringently. 

The user had to judge the document to be "most 
relevant" — that is, rate it a "1" on the four 

- . \ , « • 

point scale. \ * * 

Recall-2: defined a relevant document more broadly. 
The user could rate it either as a "1" or a "2" on 
the four point scale. 



PRECISION. ' The precision ratio was formed by dividing the 

number of relevant documents .retrieved by each 

/ 

representation by the total number of documents retrieved 

J 

by that representation. Two versions of precision were 
computed • 

Pre'ci$io*n-l: defined a relevant document stringently — 

a "I" on the four point scale. 
Precision-2: " defined a rel%vant document more, 
' broadly — a "1" or a "2" on the four point scale. 



TOTAL-RETRIEVED. This measure is simply, .J$ie total number 
of documents retrieved by each representation; it is the 
denominator of the precision ratio. It was included 
because it is an) indication of user effort required to 
read the output from the system. 
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SYlWETRIC-OVER-fcAP. For two representations, A c andB, this 

• * r 

measure is computed by dividing the number of documents 

retrieved ini common by both representations by the total 

j> ■ 

number of documents retrieved . by both Representations. 
Or more formally, it is the number of retrieved documents • 
in the^Mtter sect ion of the two fepreientations divided by' 
tfoe number o£ retrieved documents in the union of the-two 
representations. .Three versions of the symmetric-overlap 
were computed. 

Symmetric-1: counted only highly (i.e. w l w on ~the 

* < , * 

^four point scale) relevant documents retrieved. \ 

Symmetricr2: counted all (i.e. w l" or 12") relevant 

/ 

document^ retr ieved. 0 V ~ 

Symmetric-all: counted all documents retrieved. 

AS Ytofo ETR I C-OVER LAP . For two repesenta tions , A and B, 

this measure is computed by dividing the number of 

' documents retrieved by both representations by the number « 
of documents retrieved »by one of the representations. A 
smaller asymmeric overlap indicates a greater degree of 
independence of one representation (in the denominator) 
from the other representation. And, as is the case of 
the symmetrical measure, there are three versions of this 
meabure^ most relevant, all relevant, and all documents. 

> \ 

UNION-OVERLAP. For two representations, A and B, this 

0 

measure is computed by dividing the number of documents 
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' retrieved by either of the representations by the number 
of documents retrieved l?y all seven representations. It 
is. the number of retrieved: documents in the union of the 
two representations divided by the number retrieved in 
the unjlon of . all seven repreaeiata tions. Thus, the union 
overlap carl be viewed as a r<$$pK ratio for a combination 
of representations. *This measure extends to more than 
two representations and three versions of it can be 
computed: most relevant, all relevant, and all documents 
retrieved. 

B. Procedure 

Queries were obtained from users one at a time (see 
Appendix C for the directions given users). The queries 
were used as submitted; they were not screened for 
.appropriateness to * the data base or for on-line searching. 
Each of the severe searchers was given a photocopy of the 
search request. For each que^y, each searcher received 
instructions which ^specified the/ one representation that 
searcher was to u£e for tWat^query. Representations were 
assigned to searchers on each/query according to the latin 
square design. ' v 



/ 



Thus, each of/£he 84 queries was searched under each of 

5 / 

the seven representations; in total, seven searches (each 
using*a separate representation) were, carried out for each 
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of the 84 queries. 

Searchers used DIATOM to retrieve documents. Searchers 
were instructed to carry out a "high-recall" search , 
retrieving a maximum of fi r fty documents. The directions 
given to each intermediary is given infAppendix 

After all seven intermediaries completed a query, the 
seven retrieved document sets were merged into a single 
listing and placed in -reverse accession number order. The 
listing consisted of the citations a/id abstracts of all 
retrieved documents. No clue was present which indicated 

either the searcher or the, representation* 

«* ^ 

m 

Two copies of this listing were produced. B.oth copies 
were sent to the user with instructions (see Appendix B) to 
make relevance judgments on one copy and return that copy to 
the project. The second copy was for the user. 

C. Design and Analysis y 

The overall design can be characterized as a, 7x7 latin 
square replicated 12 times. * The fuM design is given in 
Appendix E. % ' 

The measures of recall , precision^ and total-retrieved 
are analyzed usimg standard analysis of variance 
computations. The design and the, analysis control for 
extraneous variables and can identify separate effects for 
representations, intermediaries, and* if ^ desired # 
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replications. 'Approximately tenV percent '(66) of the 
precision results had to be excluded^ from the analysis 
because no documents were retrieved for\ given query Under 
a given Representation. Fourteen queries iuNJ to be excluded 
from all Recall-1 analyses, and ' seven froV the Recal 1 -2 
analysis, because In each situation no relevantv documents 
were retrieved. 

The overlap measures 'may hatfe been adversely affected 
by the latin square design. Because each pair 
representations'for a given query were searched by different\ 
Intermediaries, there is a possibility that the overlap 
measures confound representations with Intermediaries. 
Keeping this concern in mind, we will compute and interpret 
the results of the overlap analyses. The -overall desi^JL^ 
will be c-hanged fc^r the second phase of this study 1n order 
to prevent 'thi s possibility. ' 





ERIC 
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Our initial concern to de^tgj^flTne if the, results 
from this /^rludy repe^t^d^the/ pattern noted earl 1< 
relatively Tittl e^4^Tffere«nc^/ ,in performance among / / ,/ the 
representa^elTs coujj^d^ith relatively little^ overlap. 
TabJUT^ presents these^resul ts. It is apparent ythat these 
results do repe^ the pattern observ^^^other studies, 



Though some ; ^ performance measur^ are signi t ig^rfTtlV 
different, none of the differences exceed 18% ^-^which is 
clearly within the rang^Xof values repor^Ted ill the 
literature. The over>tfps range from a 1 j>tf of^a-tfbut 6% to a 
high of about 17%y^ these also corre^ojHi to the earlier 
resul ts. 



remaining* part of thfs section presents these 

findings in more detail. First tjte performance measures 

will be considered*. Then the ^study of overlaps will^ise 
presented. 



Analysis of Performance 



/ Descriptive summary statistics for the^tVe j>eTTormance 
measures are presented in Table SU^Trfie means were tested 
for statistical ly signifacjyj^^lnfferences (see Appendix F 
for the" AOV Summ^isy^Tabl es) . Representations' differed 
signi ficanUy^fn the Recall-1, Recall-2, and Total -Retrieved 
scocfrSTj, The bottom of Table indicates that descriptors 



and 



titles -(TT) 
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erform 



ra7her^ poorly 



as 



Page 17 



^* TABLE 2 

Performance and Overlap Comparisons 
Between the "Best** ana>the "Worst" Representations 



REC-1 REC-2 PRE-4' PRE-2 TOT-RET 



"Best" Rep.. .404 

"Worst" Rep. .229 

Difference .175* 

Symmetric .155 - 
overlap** » 



.321 
.200 



.264 



-.121* .091 
.138 .172 



422 
336 

,086 
15T) 



19.833 
12.429 
7.404* 

.£57' 



t 



♦Difference is ^stati stical ly significant at J 



evel 



♦♦Symmetric overlap fiQupeff are takjerj^ffom TABLE Easing „ 
the pairwise overl^p^between tbe^Best" and "Worst" 
for each performance measuj*e\ e.g. the pairwise overlap 
for Relevap^l's" fojv-TA f "Best" ) and DD ("Worst") is 1 
Used fcw^-^lumn 
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representations on the recall "measures , while Identifiers 

- * t 

(JjTand title-abstracts (either TA or ST) .perform much 

« 

better. 

, i , - 

/ Even though no pairs of representations differed 
slgjij^ftcantly in either precision measure/It 1s useful to 
Include jsome consideration of precision Into these findings. 
Considering all five measures, the descriptor * (DD) 
representation performs uniformly poorly on the recall and 
precision measures while t1 tl e-abstracjt (TA) performs 
reasonably well on them though not^'as strongly as DD's 

negative performance. Interestingly, the free text words 

■ f 

assigned by Indexers (II) perform moderately well over all 
five • measure's. Stemming (ST) which would tend »to^ Increase 
the total number retrieved performs quite well on the recall 
measures, but poorly on the precision measures. The title 
representation (TT) shows the opposite pattern h1g,h on 

the precision measures (and Tot-Ret) and low for Recall. 
The other representations fluctuate quite a bit over the 
f 1ve*measUres. 

The recall and precision means given in Table 3 Sre the 
average of 1nd1vj^u-al ratios each query contributed 

equal ly^o^tfe€T^Tnal average. Another way to compute the 

average performance values is to commute th£<-rat1o last. 

< 

For exampl e, for Recal 1-1, sum the number of rel evant 
documents retrieved from all 70- qireries using a particular 
representatlon'and divide this total by v the number of 
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w V TABL ' E . 3 

Means and Standard Deviations by Representations*** 



\ 



Representation 



REC-1 REC-2 PRE-1 PREt2 TOT-RET 



DD (descriptor) 



AA (abstract) 



TA (title, and 
abstract) 



DI (descriptor 
and 

Identlfer) 

ST (stemmed title 
and abstract) 



TT (title) 



II (Identifier) 



0.229 0.200 
(70) J77) 
.319 .257 



0.365 
(70), 
. .314^ 

0.404 



0.270 
(77) 
.241 

0.290 



Minimum difference 
between means that 
are s1gn1 flcantly 
different at .05.* 

Pairs of 
representations 
that differ 



' (70) 


(77) 


.317 


-236 


0.330 




(70) 


(77) 


.328 


.284 


0.392 


0.317 


(70). 


(77) 


.352 


.263 


0.273 


0.205 


(70) 


(77) 


.292 


.207 


0.339 


0.321 


(70) 


(77)- 


.323 


.276 


0.133 


0.106 



0.173 
(62) 
.260 

0.197 
(77) 
.255 

0.224 
(78) 
.286, 

0\221 
(75) 
.270 

0.188 
(81) 
.231 

0.264 
(70) 
.335 

0.218 
(79) 
.282 



0.336 
(62) 
.330 


13.238 
(84) 
,15.824 


0.352 
(774 
.315 


17.488 
(84) 
16.850 


0.352 
(78) 
.31.8 


"18.583 
(84) 
16.245 

V 


0.361 
(75) 
.300 


16.369 
(84) 
16.166 


0.338 
(81) 
.291 


' 19.833 
(84) 
15.814 


0.422 
% (70) 
* .370 


12.429 
« (84) 
13.744 


0.403 
(79)' 
.334 


16.131 
(84) 
15.181 



t?D<TA 


DD^II 


jjp<ST 


DD<ST 


DD<AA 


TT<ir 


i " 

\ 
i 

t . 


TT<ST 



none 



5.450 



none .DD<ST 



TT<ST 
. ' ' TT<TA 



*uslng Tukey's HSD procedure. See. Appendix F for details. 

! * 

**The three values given 1n each cell of ;the .table' are 
. respectively the mean, the sample size, and the, 
standard variation. . - ' v, 
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TABLE 4 

feean. Performance by Representat 
Across Queries 



t^on 



v 



Representation REC-1 ' REC-2 PRE-1 PRE-2 



DD 


(descriptor) 


. 0.2^ 


p. 216 


0.173< 


0.335 


AA 


^abstract). 


0.328 


0.283 


0.181 


o,.332 ~ 


TA 


{title fr abst) 


0.369 


0.294 


0.192 


0..324 

* 


DI 


(descr & ident) 


0.309 


0.268 


0.182 


0.336 


ST 


(stemmed TA) 


0.304 


0.281 


0.148 


-0.291 


IT 


(title) 


0.285 


0.^229 


N 0.221 


0.378 


II 


(identifier) 


0.348 

— L , 


0*306 


0.208 ' 


0.389 
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representation and divide this total by. the number* of 
relevant documents retrieved i:ro.m all 70 queries using all 
seven representations. This is a more conservative approach 
and these values can never /exceed the values presented in 
Table 3. This approach is useful, however , because the 
unique contribution of single (perhaps a typioal)j|ueries is 

removed. • The average values computed in this manner are 

/ 

presented in Table 4. There are several parallels between 
the patterns in the two tables. * Again, the^ II 
representation performs well- on.' all four measures. 
Descriptors (DD). still show an overall ptyor- performance and 
title-abstract (TA) performs well (thbugh the similarity is 
weakened iri the precision^ measure). Titles (TT) have the 
.same 'pattern here as in Table 3, while stemming (ST) is not 
quite as .good in the recall measures and is just as poor in 
the precision measures. 



Be Analysis of Overlaps 

* * • 

-The' .simplest analysis of overlaps . if , pair wise, 

comparing each representation with every ' other 

representation. ;■ Tables 5, 6, and 7 contain the pairwise 

overlaps for symmetrical, asymmetrical, and 'union overlap. 

Each table report© the. overlap for relevant documents (only 

those fudged h and, .those judged a "i" or a "2") and for 

^aJ^documentSe [.^ . 
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As might be expected,, the pairwise overlaps decrease as 
the* number of documents under consideration increases. That 
ts r the average overlap is highest when only most relevant 
documents are included; it is lowest when all documents are 
included. 

The major finding in these data is that' the^ overlaps 
are quite small as indicated by t.he averages. This is true 

even between representations that should have retrieved very 

o 

* * 

similar sets such as abstract (AA) and titlfe-abstract (TA) 
or descriptor (DD) and descriptor-identifier (DI) . ' One 
possible explanation ' for the size of the, overlaps is 
searcher differences. The analysis of variance tables (see 
Appendix F) support this contention; they show thalf"between 

4 

searcher differences accounts for one of the largest 
portions' of the variance. However, the data in the ranking 
study (^CGlLL)^cast doubt on the contention that searchers 
are the sole or major cause of the low amount of overlap-. 
In the ranking study, overlaps between different 
representations searched by the same searcher only equalled 

14%. for retrieved documents. That figure certainly falls in 

. . *- 

the range, of values reported here. 



Going beyond, pairwise overlaps, the question arises as 

to the optimum combination of representations, or more 

• * . * l - 
precisely, the optimum ordering of representations. That 



TABLE 5 
Symmetric Pairwise Overlaps 
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AA 



TT 



TA 



ST 



II 



DI 



DD 



AVG 



Version - Most Relevant 



AA 


l'.OO'O 


0.181 


0.270 


01313 


0.212 


0. 


217 


0.125 


.220 


TT 


0.181 


1.000 


0.227 


0.178 


0.236 * 


0. 


209 


0.172 


.200 


TA. 


0.270 


0.227 


1.000 


0.307 


0.208 


0. 


236 v 


0 . 155 


.234 


ST 


0..313 


0.178 


0i307* 


1.000 


0.r79 


0. 


201 


• 0.115 


.215 


II 


0.212 


0.236 


0.208 


0.179 


i.ooo„ 


0. 


314 


P. 173 


.220 


DI 


0,217 


0.209 


0.236 


0.201 


0.31*4 


1. 


000 


0.27O 


.241 


DD 


0.125 


0.172 


0.155 


0.115 


0.173 


0. 


2*70 


1;000 


.168 



Version - All Relevant 



4 AA 


1.000 


0.141 


0.215 


0.235 


0. 


167 


0. 


186 


0.112 


.176 


TT 


0.141 


1.000 


0.154 


0.133 


0. 


173 


0. 


172 


0.150 


.154 


TA 


0.215 


0.'154 


1.000 


0.245 


0. 


167 


0. 


173 


.0.114 


.178 


ST 


0.235 


0.133 


0.245 


1.000 


0. 


138 


.0. 


137 


0.081 


.161 


II 


0.167 


0.173 


0.167 


0-.138 


1. 


000 


0. 


242 


0.138 


.171 


DI 


0.186 


0.172 


0.173 


0.137 


0. 


242 


1. 


000 


0.258 


.195 


DD 


0.112 


0.150 


0.114 


0.081 


o, 


138 


0. 


258 


1.000 


.142 



Version - All Documents 



AA 


1.000 


0.064 


0.148 


0.138 


0. 


112 


TT 


0.064 


1.000 


0.072 


0.057 


0.' 


086 


TA 


0.148 


0.072 


1.000 


0.156 


0. 


096 


ST 


0.138 


0.057 


0.156 


1.000 


0. 


077 


II 


0.112 


0.086 


0.096 


0l077 


i. 


000 


DI 


0.103 


0.080 


0.092 


0.063 


0. 


131 


DD 


0.046 


0.068 


0.052 


0.033 


0. 


063 


** 


« 






* 




< 



0.103 
0.080 
0.092 
0.063 



0.046 
.0.068 
0.052 
0.033 



0.131 % 0.063 
1.000 0.120- 
0.120 1.000 



102 
071 
103 
,087 
,094 
,098 
,064 
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TAB IE 6 

Asymmetric Pairwise Overlaps* 
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AA 



TT 



TA 



ST 



II 



DI 

X 



DD 



AVG, 



Version - Kost Relevant 



AA 
TT 
TA 
ST 
II 
DI 
DD 
AVG 



1.000 
0.286 
0.451 
0.459 
0.361 
0.346 
0.192 
0.349 



0.329 

rTouo 

0.424 
0.312 
0.424 
0.359 
0.268 
0.353 



0.401 
0.328 
1.000 
0.428 
0.334 
0.351 
0.221 
0.344 



0.496 
0.293 
0.520 
1.000 
0.325 
0.337 
0.183 
0.359 



0.340. 

0.348 

0.355 

0.284 

1.000 

0.450 

0.248 

0.338 



0*368 0.266 
0.332- 0.323 
0.420 v 0.344 
0.332 0.234 



0.508 
1.000 
0.376 
0.389 



0.365 
0.490 
1.000 
0.337 



0.367 
0.318 
0.419 
0.341 
0.386 
0.389 
0.248 



Version -All relevant 



AA 
TT 
TA 
ST 
II 
DI 
DD 
AVG 



1.000$ 0.276 
0.223 1*000 



0.361 
0.379 
0.297 
0.305 
0.178 
0.291 



0.348 
0.237 
0.304 e 1.000 
0.261 0.385 



0.344 
0.319 
0.253 
0.293 



0.292 
0.283 
0.178 
0.287 



0.361 
0.212 
0.402 
1.000 
0*254 
0.235 
0..132 
0.269 



0.275 
0.258 
0.281 
0.233 
1.000 
0.366 
0.207 
0.270 



0.323 
0.274 
0.310 
0.247 
0.418 
1.000 
0.370 
0.324 



0.233 
0.268 
0.241 
0.172 
0.292 
0.458 
1.000 
0.277 



0.306 
0.245 
0.316 
0.279 
0.316 
JO. 328 
0.220 



Version - All Documents 



AA 
TT 
TA 
ST 
II 
DI 
DD 
AVG 



1.000 • 

0.103 

0.265 

0.259 

0.193 

0.180 

0.078 

0.180 



0.145 
1.000 
0.169 
0.141 
0.182 
0.172 
0.131 
0.157 



0.250 
0.113 
1.000 
0/279 
0.163 
0.158 
0.085 
0.175 



0.229 
0.088 
'0,262 
1.000 
0.129 
0.1O8 
0.053 
0.145 



0.210 
0.140 
0.188 
0.159 
1.000 
0.233 
0.108 
0.173 



0.193 

0.131 

0.180 

0.131 

0.230 

1.000 

0.194' 

0.177 



0.103 
0.123 
0.119 
0.080 
0.131 
0.24O 
1.000 
0.133 



0.188 
0.116 
0.197 
0.175 
05l7t 
0.182 
0.108 



*The representations in the columns form the denominator of 
the overlap, measure. , 



. \ 
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TABLE 7 
Union Pairwise Overlaps g 



AA TT TA ST II DI DD AVG. 



Version - l"ost Relevant 

AA 0.328 0.520 0.549 0.481 0.558 0.523 0.502 0.495 

TT ' 0.520 0.285 0.533 0.500 0.512.. 0.491 0.446 0.470 

TA 0.549 0.533 0.369 -0.525 0.594 0.548 0.525 0.519 

ST 0.481 0.500 0.'515 0.304 0.553 0.510 .0.485 0.478 

II 0.558 0.512 0.594 0.553 0.348 0.500 0.499 0.509 

DI 0.523 0.491 0.548' 0.510 \ 0. 500 .©.309 0.430 0.473 # 
DD 



U.DZJ U.DSO' U.D1U \U.DUU U.*OU U . «± / J 

0.502 0.446 0.523 0.485 \).499 0.430 0.237 0.446 



Version - All Relevant 

AA 0.283 0.449 0.475 0.457 0.505 0.465 0.449 

TT 0.449 0.229 0.453 0.451- 0.456 0.424 0.388 

T.A « 0.475 0.453 0.294' 0.462. 0.514 0.479 0.458 

ST 0.457 0.451 0.462 0.281 0.516 0.483 '0.461 

II 0.505 0.456 0.514 0.516 0.306 0.462 0.459 

DI 0.465 0.424 0.479 0.483 -0.462 6.268 0.385 

DD 0.449 0.388, 0.458 0.461 0.459 0.385 0.216 

*N . _ 




Version - All Documents 0 

AA 0.220 0.353 0.395 N). 412 0.380 0.386 0.369 0.359 

TT 0.353 0.156 0.363 0.384 0.331 0.\33.5 0.302 0.318 

TA 0.395 0.363 0.234 0.418 0.398 0.402 0.380 0.370 

ST 0.412 0.384 0.418 0.249 0.420 0.428 0.402 0.388 

II 0.380 p. 331 0..398 0.420 0.203 0.361 0.347 0.349, 

DI 0.386 0.335 0.402 0.428 0.361 0.206 0.332 0.350 

DD / 0.369 0.302 0.380 0.40*2 0.347' 0.332 0.166 0.329 
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t 

is, If a retrieval environment were limited to a single 
representation, which one would it be? If a second could^be/" 
added, which of the remaining six representations contribute 
the most over and above the effect of the first 

9 

representation? A third representation could be added over 
and above the first two, and then a fourth representation, 
and so on* 



The most sensible measure to use j in ' answering this 
question is the union overlap. Tables 8 and 9 present the 
results pf this analysis. Table 8. uses all seven 
representations and analyzes both the highly relevant as 

0 

well as the total relevant measures across queries. Since 
three representations (TA C DI, ST) are composed of other 
representations, the analysis was repeated in Table 9 
omitting these "compound" representations. 



1 .Tables 8 and 9 present four di 
different orde rings of representations 
consistent, would allow a searcher 
combinations, of fields would be most 
relevant document^. Such models Would al 
economies in the desigfl^and operation of 
Unfortunately, these data suggest that th 
consistent* What appears to be highly 
is the- cumulative; increase in the yperoen 





tage of releva 
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TABLE 8 

Representations Ordered by Incremental Improvement 



Version - Most Relevant 

Order* ^ 1st 

Representation TA 

No of Documents' 299 
1 




Version ■* All Reliant 




2nd 


3rd 


4th 


ST 


DI 


TA 


889 


1118 


1318 


.516 


.649 


.765 



5th 


6tb/ 


7th 




ST 


DI 


722 


768 


810 


891 


.948 


1.000 


5th 


6th 


7 th 



TT 



.850 ] ; 



AA 



930 1.00 




9 

£RIC 
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j 

TAB LI? 9 

Representations Ordered by Incremental Improvement* 



Version - Most Relevant 










Order , 

a 


1st 


2nd ' 


3rd 


4th 




Representation 




AA 


, TT 


DD 




No. of Documents 


282 


.452 


'554 


634 




Cum. Percentage 


.348 


.558 


.684 


.783 




Version - All Relevant 








r 


Order 


1st 


2nd 


3rd 


4th 




Representation 


II. 


AA 


DD .* 


TT 




No. of Documents 


$27 


870 


1093 


1275 




Cum. Percentage 


.*306 


.505 


.634 


.740 





♦Compound representations omitted. 
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documents accounted for^as each additional representation is 
•included. This similarity may simply be due to the fact 
that the four models .are kafeed on highly interrelated data 
— data that are subsets __of one another. When the 
cumulative percentages are "plotted against the order, the 
resulting curves appear to be Zipfian in ^ form and when 
broken down According to Bradford's law of 'scatter , the 

obtained proportions are 1:3:7. The theoretical proportions 

* -» 

could easily be in the form 1:3:9, but no Atempt was made 
to verify this analytiqally. + \ 

An -ancillary question is that of unique contribution of 
the different representations. That is, for a given 
representation,.- what documents' does it contribute to the 
relevant retrieved that were not retrieved under any other 
representation? The question is equivalent to the observed 
•imprQvanents in the models when the representation is the 



last entered into the model. Tables 10 and 11 report 
incremental improvement for erac^-fepresentation, assuming 



the representation entered^ the model first or last. . These 
are the maximum and^ minimum incremental improvements for 
each representation. Again, the index phase is 
distinctively • unique, bu£^ittore so^nfder the full model than 
^under the restricted one.^-^Table 11 shows AA f s unique 
contribution to b^-e^uivalent to II when the overlaps with 
the compound field (of which AA *Was a part) are not included , 
in the model. These systematic differences ' in incremental 
improvement suggest • that the patterns of overlap may be 



TABLE 10 
Recalls and Unique Contributions 
of 7 Representations / 
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/ Entered 1st* : 
Reps. ^ No. of Docs ' r % 



Entered, Last* 
No. of Docs 



Version Most Relevant 



AA 

DD ~V 

DI 

II 

ST 

TA 

TT 



266 


.328 < 


49 


.060 


192 


.237 


44 


.054 


250 


. 309 


42 


.052 


282 


.348 : 


74 


.091 


246 


..303 


44 s 


.054 


299 


.369 # 


53 


.065 


231 


.285 


52 


.064 








.440 



Version - All Relevant 



AA 
DD 
DI 
II 
S.T 
TA 
TT 



488. 
373 
462 
527 
485 
506 
395 



.283 
.216 
.268 
' .306 
.281 
\ 244 
.229 



137 


.080 


127 


.074 


120 


.070 


196 


.114 


149 


.086 


134 


.078 


133 


.077 




.579 



♦Entered ist is the equivalent oTrecall-l across 
queries when no overlap is taken* into account. 
Entered last are the unique documents found' 
only by that representation. 



r 
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TABLE 11 

Unique Contributions" of 4 Representations* 



Rep. No of Does 



No of Docs 



Version-host 


Relevant 


Version-All 


Relevant 


AA 125 


.196 


269 


.210 


DD 85 


*133 


197 


.154 


n ii4 


.178 


271 


.213 


TT 88 


.138 


182 


.143 



*Recalls on 1st entered are same as in TABLE 10. 
Compound representations excluded. 
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representation- specific. It should be noted ,though, # that 
the best unique contributor, II, in the full model retrieved 
only 20% (i.e. * 0 .09;!/. 44) of the uniquely found documents ' 
and performed at the .35 * recall level*. Table 10 also 
repprt^cfthe ^uin of th^ unique > "^rcjentages, 44% for the rel-1 
measure, 58% for rel-j2. In other words only 56% and 42% of 

the documents were/ overlapped; another indication of the 

t 

low probability of /6verlap observed in this and other 
studies. / ,3 

• / 

Lastly, it is^important to restate the difficulty of 
,' - * 

clearly interpreting the overlap measures. As previously 

mentioned, represe'ntations may be confounded with searchers. 

- / . 

/ 
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VI. PHASE II PLANS 



The second phase of the representation project is 
designed to 1) replicate the observations and findings of 
the first phase, 2) develop , models that" account £pr % the 
results, of the^^irst phase and % 3) test these in the 
experimental environment of the second phase. This section 
describes anticipated changes- and extensions of the study 
methodology, that will* be incorporated 'ir> the second phase. 



1. Data Base: The data base for the second phase will be a 

portion of the 1980 Psyclnfo data base produced by the 

American Psychological Association: the printed counterpart 

• i 
is Psychological , Abstracts . 12,000 records will again be 

used; dissertations will be excluded .from the loaded data 

base. Psyclnfo was Selected as a "soft" dfcta base with a 

different user population, - in order to test the 

gerierali^ability qf the INSPEC study results. Additionally, 

Psyclnfo* records contain, ^the same four Nf ields % "that 

constituted \ the representations: , descriptors, title, 

abstract and aNfree text index phrase. A user .population 

for Psyclnfo and searchers^xperienced with the data base 

are readily available- The DIATOM programs will again be 

usedT. 

* 1 * V 

• » • 

2. * I}e*search Desigru The latin square design controlled 'for 
searcher differences oh 'the performance dependent variables, 
but not oir^the- overlaps. A. different research design will 
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be used in order to obtain estimates of-overlap attributable 
to (l\ representations and ' (2) searctters^^"^' 




Itworderto obtain searches -"on the samejqu^ry, 
same . representation for all searchers, the number of levels 
of representations and searchers probably will be 'reduced; 
the four primary representations will be mainta^edt title f 
abstract, index^p^a^e and descriptors^ four searchers will 

be used to obtain a balanced desicjn. 

* • ■* 

) • 

3. Procedures: ^Procedures will parallel those of the first 
phase, revised to meet the requirements of the research 
design. Thi^ will be achieved by using some form ofr^* 
completely crossed factorial design. 



A. Models: A major » activity of Phase II will be the 
development and analysis of ^models that account for the 
observed findings.^ Our current interest is in probabilistic 

models: * by Qhance alone what is the minimum and maximum 

* - - « 

overlaps among representations that; could be expected for a 

given data base. For the minimum overlaps we can proceed by 

assuming complete independence of representations and by 

using the relative frec^uency of eadh representation, we cart 

determine the probability that random samples "t>f. two 
' * . -> 

representations" will contain documents in common. **>-t-** ■ ' 
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the maximum overlaps can be calculated from* an ^analysis 
of the number of unique words (types) in , each 
representation. For example, in a saltfple of 1500^ documents 
in the 'INSPEC data base, there are 9674 unique words in the 
abstracts (AA) , but only 3481 types in the titles (TT) . 
This lower number clearly puts an upper limit on the overlap 
between the two representations. Truncation must be 
exclqded from considera^fo^^in this .type of analysis? 

otherwise the refill not be any real limit on the maximum 

. '*„ . / 

possible overlap. * 

When this analysis Is completed, other types of- models 
need to^ be explored — particularly models which will 
attempt to predict the performance-overlap results of both 
phases of this project. 



5. Activity: The "data in this report will continue to be 
analyzed by the project staff and consultants identified in * 
theJ|5roposal. Data collection for; 1 hyftothes^s testing will 
go on as the second phase is implemented, (eag* data base 
characteristics . including distribution of terms in the 
representations, and distribution! of search technique by 
representation and by searcher) . Again, the emphasis will 
be v on. representations rather than searchers or searches') 
searcher difference will be incorporated only as necessary 
to control the variable in the .overlap measures. * • 
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PROJECT DESCRIPTION . 



Appenaix A 



Thi£ project will examine the relation between the relevance 
of retrieved* citations and the fields that were searched to 
obtain them. Retrieval from seven different document represent- * 
^ations will be studied. A representation consists of one or two 
designated search fields. 

The data base for the study is Computer and Control Abstracts 
(a subfile of |NSPECK tfhe system you will. use is a local 
simulator of DIALOG, mounted on the S.U. computer. Almost all 
DIALOG features are available for^you to use, but some 'restrictions 
will be fljhde to achieve the study objectives. 

* • * 

it 

' The objectives of the study require you to conduct hiffh 
recall Searches, but with a limit ^ot no more than 50 citations 
per query. 



In all, yo\x will be asked to search/*tT queries. Over the 
course of the study, you will use all^^ven representations, but 
for each query only one representation will be assigned. 

'J" ' *— - j 

? * For' each query, you w>H be asked to 'search from a request 
form* the statement of the query was prepared by a real user who 
will receive the^ o\*tpar£. The request form will also prescribe 
the repres^tation^ou are to use. TJae unique' pas sword assigned 
to the request^rfll automatically "lock" the search so that you 
cam only seajpcJnon the designated parts o£ the citations. 



i^Ughlin* 




L you have completed each search (including the 
^©eutxcxx print command) t ' return the search requekfc form and 
copy of your interaction with the, system to BrJ 




\ 
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SEARCHER'S' JOB 



Your job as "a searcher on this project will be to prepare 
and" carry-out a." high recall search for each request using one 
of the seven representations as specified. 

You will receive the query statement as it Was written by 
the requestor. This will be the only information -you will receive 
regarding the .user's request since there will be.no face-to-facet 
or telephone negotiations between you and the user. 

One of the seven representations will be designated on 
the request form. The computer will be restricted to conduct 
the search using that representation, therefore your search 
strategy should be planned accordingly. You will be given a 
thesaurus for controlled vocabulary descriptor searching. 

I 

You may perform the search on any terminal that is or can 
be connected to Syracuse University, that is convenient for you, 
as long ks hard- copy can be printed. You are to perform a 
high-recall search with fifty citations as a maximum. You will 
be expected to complete the search within 48 hours after receiving 
the request form. Then return (1) the search request form - 
filling in the needed information, and (2) a copy of your inter- 
action with the system. 



NOTE: Limit the use of -the thesaurus to this study only. 

We are legally bound by our contract to this limitation, 
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: - - DATA BASE * Appendix A 

Computers and Control Abstracts is, that portion of the INSPEC Data 
Base dealing with all areas of computing and information science. 
The specific data base that wilJ^be searched in this study consists 
of four months (Sept. - Dec/ 1979) of Computer and Control Abstracts. < 

The citations you will retreive will be organized as follows: * 

DNnymber (abstract numbers from INSPEC journals) 

Title s . 

Authors (separated by commas) 

Source field: as follows * 

Publication: (volume and issue number) (part number) 

«. pagination data 
Following "this may be information in C 3 • This is 
information on the cover-to-cover translation as 
follows: t publication; N £ v °lume and issue) pages 
date} (type of unconventional media) (availability) 
(Title of cbnference) , (location of 'conference) ? 
(sponsoring organization)- (date) language 

Abstract * 

Indexing information * 
NOT all the citations will contain each of these items of information. 



DIALOG - SIMULATQR~DIFFERENCES 
» * 

The DIALOG simulator you will be using^ to conduct the searches is 
almost identical to "regular" DIALOG. In general/ searching should , 
be performed in the same way as any DIALOG search.- ~v 

* • * 

The restrictions , cautions and limitations are noted below. 

* * » 

L. Each new query you search must fee started with the full 
bsgin/ c 

f 2. To restrict a search to a. particular language, use a 
^ Limit /ENG (for English) , or whatever language you wish. 

3. Adjacency (jiW) cannot be used with either truncation or 
^stemming. ^ ~; ' 

4. Adjacency may- rwfe very slow;_the field operator- (F) can 
be used instead. 

' . : j '* <. • 
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THE REPRESENTATIONS - Appendix A 

You will be using seven different representations during the 
study. A representation names the one_or two fields of the citation 
to which yoxir -search must be restricted. You will search on pnly 
one representation for any* given query. The representation you 
are supposed to search on will be designated. on the request form 
we gave to you. A unique password will be given with each request 
atyd this password will automatically lock* the search onto the* 
assigned representation. < ; \ 

' ' * * 

The seven representations and the fields they will search 
are as follows: * * 



TT - will search terms in title only. 

AA - will sfearch terms* An abstract only. 

DD - will search descriptor terms only. A thesaurus will 
be provided to you for use with this controlled 
vocabulary representation. (The thesaurus may only 
g be us%d on this project). 

II - will search identifier terms only.^ 



TA 



- ,will search terms pcij title and abstract only, 



ST - will* search ^stemmed a terms in title an£ abstract only. 
The computer will automatically take the logical root 
of any enteret' €ersu\ Truncation carinot'be use^JjitjK 
, this representation. • ; 

.DI - will search terms in descriptor and identifier fields. 
The thesaurus will be provided for use with this 
controlled vocabulary representation. 

One representation with which you may be uhrasjoiliar is 
"stemming (ST) , which' will be 'used with title and abstract words 
only. A- stemmed term is a word that has been shortened. by the 
computer to its logical root. This is similar to truncation in 
that the stem LIBRAS would retrieve. LIBRARY, LIBRARIES, 
'LIBRARIAN, etc. For truncation however/ the toot, is determined 
by the searcher. For example, if ypu entered LIBRARY under the 
ST representation, the computer would" automat ically be reduced 
to its logical root and LIBRARY, LIBRARIES, LIBRARIAN, LIBRARIANS , 
etc . would all be retrieved. 

Truncation is not tojae used With the stemming representation, 
In fact, the simulator will reject any attempts to use truncation 
in this representation. _ 
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NAME: 



SCHOOL ADDRESS; 
HOME ADDRESS: 



Appendix A 
DATE: 



PHONE : 



PHONE: 



f We would -like a description of your topic of interest. This 
statement should be. clear enough so that any person who also knows 
about this topic would, on the basis of this statement alone, be , 
able to pick out citations of interest for you. 

Please write your. description here; 

I a. hi- \»js££sis£ in. /* A k»« about Voi'ae reAo^H-t'MoK. 

I^SSSaS r*-e osey or ^xeoA. r*c<xy 



tX9H- 



Qcafe&i&S sirens. I om. ^rklculaylt^ iH.-h.r4sj-e.et /* +A.e. 



rt.cxto X K%4'*oti. I * Jt <Qft-x,f ctTol+ioKS 4kcJ& elect o*-/t 



t — tt — ~T J r - ^ 

iolu^ jj A./50 tK&vcfe- volet, ^eco«t • Hoi^_ . ; 





.Given your 
do you wajflt? 



in requesting this search, how many citations 

— 1 




About /how many citations on your topic do you expect to receive 
fro^this computer search? PRfr^ **J 




YOU MAY FOLD THIS REQUEST- FORM IN THIRDS. : STAPLE SECURELY, AND / 
DROP IN CAMPUS MAIL, . ' , - 4/4/80 / 



c : .-. , 
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NAME: DATE;_ 

SCHOOL ADDRESS ; [ ] ; PHONE: 

HOME ADDRESS: PEONEi 



We would like a c description of your topic of interest. This 
statement should be clear enough so that any person who also knows 
about this' topic would, on the basis of this statement alone , be 
able to pick out citations of interest for you. 

Please write your description here; 

Mij itcf-eve^'V'' ?Kyoiv« na4-t'<m.af s>HuJ »nl4-ev*arKonal 

paMfeu issues as •Hcei^ rcli+c 4o co>Wpu4-eys »kJ ?tvPoir»ica4io^. 

I loou(c) like }vc&vhi»Vton- ak'a nl jjg I > jHct jaaU4*/fc»l 

Jt ■P-P/.v^-n.'fc pglfcic* a£Pecfr <Jaf»t>asc usay,j ap|>i«ca-HoKS f 
&kJ ces4-. Al4-lo>u^U- 1 am. esj?tci> Iti^ iK.-Veves4-eJ in. 

amJ gP~P wAK^yvKewjt:^ I looulJ like as' mamy fci-fa-Ko»y» »S 
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1 t 

Given your purposes in requesting this 4 search, how many citations 
do you want? 



About how .many citations on your topic do you expect to receive 
from this computer search? 



YOU MAY FOLD THIS REQUEST FORM IN THIRDS • STAPLE SECURELY , AND 
DROP IN CAMPUS MAIL. 4/4/80 
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r WSP INFORMATION RETRIEVAL- PROJECT * 

INSTRUCTIONS TO PARTICIPA NTS , 

Attached you will find a copy of your interest statement and 
two copies of a list of references. List (a) is to be used as 
part of- the study and should be returned after you make your 
judgements of relevance. Copy (b) is yours to keep. 

Each citation is organized into seven parts: 

* v * 
DN - Document identification number 

TI - Title / 

AO - Atrthor *~a \ 

SO 7 Source of the citation (i.e. journal title) 
AB - Abstract > > 

DT - Date ^ 
DE - ' fescriptors of the citatidn 
L 

Please read each citation and abstract to form an idea of what 
that particular document (book, article, report)' is about.' Compare 
this to your interest statement, and for ieach citation liste£, 
decide how closely that "citation is related to your topic. Based 
on the information in front of you, is the citation relevant to 
your topic, or not re^vant to what you had in mind. 

* • * « 

Use the following scale for your judgement: 

1.- Definitely relevant to your topic. 

2 ~ Probably relevant to your topic. 

3 - Probably not relevant to ydur topic. 

4 Definitely not relevant to your topic* 

\ 

Ptease rate each citation by placing the number' corresponding 
to your judgement in the box immediately f ollowing eacfr citation. 
After .you have checked all the citations to see Aether^ or^npt 
they vare relevant to your interest statement; please return the copy 
with the judgements to us in the pre- addressed envelope through 
campus mailV If you- are not on campus, these envelopes should be 
used to return the completed forms to us through the , regular mail 
service. Thank you for your cooperation." --"~~*rrdp-- 

If you have any questions, please contact \is at: >J ' ^ 

School of Information studies 
Syracuse University 
w 113 Euclid Avenue 
Syracuse, New YorfcL 13210 
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SYRACUSE UNIVERSITY Appendix C 

SCHOOL OF INFORMATION STUDIES 



" • 113 EUCLID AVENUE SYRACUSE, NEW YORK 13210 PHONE (315) 423-2911 

NSF INFORMATION RETRIEVAL PROJECT ^ 



We are working on a project which will help us under- 
stand how the pertinence of information retrieved by computer 
is -related to the method by which.it Is searched* 

For "this projects we need information requests which will 
be searched in Computer find Computer Control Abstracts (from 
October 1979 to January 1980) . If you need information in . 
the area of computers and information science , we will 
conduct a search for you free of charge » All you have to 
do is submit a search request to us and give us information 
on* how we did after the search* 



For the searqh request we would like you to describe a 
topic of interest to you; one you^are working on or are 
familiar with, in the computer afield. Several days later 
you will receive, a list of citations that have been, retrieved/ 
by computer. You will be asked at that tim6 to indicate 
which of these are peftinent to your intere6t. One copy of 
the computer output will be returned to us and the other copy 
will be for your own use. 



We would very much ^appreciate your cooperation and * 
participation in this project, if you are willing to 
participate, please read the attached pages and wr;Lte your 
search request ~in . the space provided. 

If you do not need a search, please pass this form to 
a student. , . , - " . 
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SYRACUSE UNIVERSITY * 



SCHOOL OF INFORMATION- STUDIES 



113 EUCLID AVENUE SY^CUSE, NEW YORK 13210 PHONE (315) 423-291* 

* 0 * 

NSF INFORMATION RETRIEVAL PROJECT 



- As a participant in this project we would like you to submit 
a search request Con the attached form) about some aspect of > 
computers . and information science. , • 

* We wiH. take your request and search the current issues of 
COMPUTER AND COMPUTER CONTROL ABSTRACTS. The results of this 
search<will be a list of citations to books and. journal articles. 

We will then sive you this list of citations^>nd ask. that 
you let us know Which of these are most pertinent to your search 
request. -\ * 

j ************ 

* 

The enclosed form is for you to describe youjp topic of 
interest. If you are, planning a talk or "doing a paper, you 
probably have a topic in mindrlif you don't have a topic ^ou are 
working on, consider one with which you are familiar. Using this 
form^ write, down your information requirements as if you were 
talking to a colleague who understands the field as well as you 
do. Don't worry about trying to say it in "computerese"; we have 
trained people to make sure tjaat your search is conducted pro- , 
fessionally. " . 

* • " ^ ' 

************ 

Thank you for your cooperation. If you have any questions, 
please feel free to contact us. 



NSF Information Retrieval Project 
School of Information Studies 
113 Euclid Avenue 
Syracuse, . New York 13210 
(315) 423-4522/ 

4/4/80 
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NAME: 



SCHOOL ADDRESS s 
HOME ADDRESS: 



-^■—Appendix C 
DATEt, 



PHONE : 



PEONE t 



We would like a description of your topic of interest. . This 
statement should be clear enough so that any person who also' knows 
about this topic would, on the basis of this statement alone, be 
able to pick .out citations of interest for you. 



Please write your description here; 




V 



Given your purposes in requesting this search, how many citations 
do you want? > ; , 

About how many citations on your topic do you expect to receive 
from this computer search? ^ 



YOU MAY FOLD THIS REQUEST FORM IN THIRDS. STAPLE SECURELY AND 
DROP IN CAMPUS MAIL. . . 4/4/80 
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SEARCH QUERY COVER SHEET « ? • . p age i 



Searcher; 



Search Query Number. 



..Bate" to .Searcher: Representation Code this Query;' 

Date to be .Returned; DIALOG Passwor d ' 

• * ^ * *** *"* *** *~~ **** *** *■* " "** T " *" "* 

Some "important Notes; 

. <rs ' * * ' 

»«™ neW ^ ue *y to b e searched must be started by the full 
BEGIN command. , . 

\ 

2. You do not need to LOGOFF after -each queVy before starting the 
next query. You^do need to "PRINT the documents retrieved 
before typing the BEGIN command for the' new query. ' ' 

3. Truncation cannot be used with the stemming representation (st> r 
it can be Used v/ith other representations . 

> 4. Though you. can use . ad j acency" , yo,u should know that it may run 
. vejpy slowly. Instead, you may chooser to use^the field-oper-* 
ator (F). This implementation of DIALOG will not allow the 
. use of adjacency with truncation, or adjacency with stemming. •* 

^Q-I»QGO N and LOGOFF V • * " 

The step-by-step sequence for connecting "With the computer, for 
conducting a DIALOG search, and for disconnecting from the computer 
is given below. 1 . 

- ^EV er Y thi ng you type at the. terminal must be sent to the computer 
with a carriage return. * ' . • 

The computer responses to some of these commands are not given here. 

■ ~ ~ '** ~ " u t " -' - 

1. If you are using a dial-up terminal, the phone number is 
423t1313,. Remember, it must be a hard'-copy terminal. 

•2. Turn power on and hit. carriage return.. 

3'. Type: LOG 3434^14 . 

4., Type: NSF * .. .. 

5. Type: DO DIALOG u % ■ 

t The computer will ask for your dialog password. It is , 
given "at the top of .this page.* ■ 



if* 



•Date Returned -to Date Returned 

Brian McLaughlin? to NSF: 



SEARCH QUERY CO^ER SHEET - Page. 2 
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6. Types BEGIN 

. The computer will ask for' theiquery number and the , 
representation code . Both pan^ be found at the top c 
Page I. > ' 

7. Carry out Che' search ?for\ this query. 



\ 



V 




Remember, wa want a high recall search with a maximum ft 
50 documents! retrieved. 

* ■ • 

Before startxig a new query yovNneed to have the set /of 
retrieved documents printed. ' Use\the PRINT command;/ the 

format , shojjld, always be 1. 

^ ,1 \ 

•a. If yoy Xrant ta search another query , Ipok at the COVER SHEET 
* fS*£ or that query and be^fin at Step 6. 

*If you are completely done searching for"no\^, go to Step 9. 

. 9. Types LOGOFF 

10 .i Type: K/F 

ll\ Turn power *off f collect your materials and submit th. 
Brian McLaughlin. 



itting Searches 



Brian McLaughlin will distribute and collect all searches. V7hen 
a search is completed, you need* to 'submit^ this COVER SHEET and a 
copy of your interaction. Queries should* be searched and 
returned within x 48* hours after receiving them. 



Help and Assistance 



h* Brian McLaughlin 
210 Hubbell Avenue 
Syracuse , New York 



476-7359 (Home) 
423-20^1 (Work) 



2. NSF Retrieval 'Project 
113 Euclid Avenue . 
Syracuse, New York' 



423-4522 
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AOV SUMMARY TABLE: Recall-1 



Source 


Sum of 
Squares 


df 

* 


> Mean 
Square 


1 

F * 


Between Squares, 


2.624 


ii 


.239 




Queries in Squares 


10.415 


58 , 


.180 




Searchers 


1 4.072 


. 6 


.679 




Squares X Searcher 


7.940 


66' 


.120 




Representations 


1.415 


.6 


.236 


3.324* 


> 

Square X Representation 


6.021 


66 , 


.091 


1.282** 


Residual 

(by subtraction) 


19.714 


.276 


.071 


f 


Total * 


-&2-.201 


489 







,1 



♦Region of rejection begins at 2.14 (ot=.05) or 2.89 = . 01) 

**Region of rejection begins at 1.'12 (<*-.25). Since obtained 
value falls within the region* of* rejection, the square X 
• representation source of "variation is not pooled into the 
residual. \ .\_: \ 1 . • 



NOTE 1: Tukey's HSD rfeiion of rejection = 4.17 
standard erro^p - \ 0318 t . 

NOTE 2: Missing" values in the data (14 queries retrieved ho, 
highly relevant documents) required a least squares 
solution to* the analysis. This approach exceeded . 
the limits of the computer. Approximation methods 
were then employed. 
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AOV SUMMARY TABLE: -Recalls 



Source 



Squares 

Queries in Squares 
Searchers 

Squares X Searchers 

Repr es enta tions 

Pooled Error° 

(by subtraction)* 



Total 



Sum of 
Squares 



.963 
. 5.678 

4.0 88/ 
* 4 '842 

1.032 
19 v .0 38 



35.641 



df 



11' 

/ 

65 



66 
B 

.384 



538 



188 



3.44* 



89 <(o( = .01) 




■ : — 7 — : ~ 

/ _ ' 

*Rtegion of rejection begins at 2.. 14 (oK=.05) or 2 

/ 
/ 

tJOTE ly Tukey's HSD region of rejection = 4.17 
/ standard error = .0255 

Missing values in the data (7 queries retrieved no 
relevant documents at all) required a least £<}uares 
solution to the analysis. This approach exceeded 
the limits of the ^computer. Approximation methods 
were then employ ed.- 
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AOV su: 



Wife' 



Y TABLE: Precision-1 



Sources 


ss 


df 


MS 


F . , 

-4 


Squares 


' 3'. 536 


* 

11 


.321 




Queries in 
Squares* 


15.066 


72 


.209 


• 


> 

Searchers 


0.528 


6 


. .088 




Squajres by \ • 
Searchers 7-"" 


3.740 


66 


.057 


\ 


Representations 


0.219 


6 


.0365 


.829 (n.s.) 


Pooled error 
(by subtraction) 


15 . 829 


t 

360 

* 


.044 


• 


Tptal 

* 




521 




) 



V 



ssing values in the data C66 c&ses with no documents 
ferieved) required a* least Squares solution to the analysis. 
This approach exceeded the limits- of. the computer. Approxi- 
mation ^methods were then* employed which results in more than 
one value for t&e Queries in Squares sum of squares. The 
value given above is the smaller of the two values, which led 
to a* slightly larger value for the Error s van. of squares. The 
approach, is conservative in the sense that i£ the effect of 
representations were to be significant, it would also be 
significant if the other value for m the Queries in Squares sum 
of sqjiares were used. 



AOV SUMMARY TABLE: Pre\?isioA-2 





Appendix F 



Sources" 


SS 

* « 


df 


MS 

\ 


F ; 

■a* 


Sqtlares 


5.489 


11 


.499 




i ^ v* ? ^ o ^ « • 

yuciics in 
Squares* 


19.886 


72 


' .276 


L ' ( 


Searchers ^ 


0 .691 


6 


.115 


Squares, by 
Searchers ^ 


5.348 


66 


^0 81 




Representation 


0.364 


6 , 


.0607 


1.05 (n.s.) 


Pooled Error 
(by subtraction) 


20.788 


360 / 


/ .0577 




Total 




521 








*Missing values inthe data C66 cases with no documents 
retrieved) required a -least squares solution to the analysis. 
This approach exceeded the Limits of the computer. Approxi- 
mation methods were then employed which resulted in more than 
one value .for the Queries in Squares sum of squares. The value 
given above is the smaller o.f the^two values, which l^ed to a 
slightly larger Value for the^Errol^siam °t squares. The 
approach is conservative in the ser^ste that if the effect of 
representations were to be significant, it would also be 
significant if the othef value for the Queries in Squares 
stan of squares were" used. 
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AOV SUMMARY TABLE: Tot- Ret. 



Sources 


Sums of 
Squares 


df 


Mean 
Square 


F 


Between Squares 


10688,347 


11 


. 971. 


668 




Queries in Squares 


40273^878 


72 


. 559. 


359 




Searchers 


19 316,177 


6 


3219. 


363 




Squares X Searchers 


13719.415 


66 


270. 


870 




Representations } 


3654 .511 


6 


609 . 


085 


4.24* 


Residual 


61236.183 


426' 


143. 


747 




V 

TQtal 


148888^ 51 


587 







*Region of rejection begins at 2.14 Gs*=.05) o^ 2.§9 (^ = .01) 

NOTE: Tukey's HSD region of rejection = 4.17; 

standard error = 1.30JL „ 
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